Skip to content

feat(#1425): strict_provenance config flag for runtime enforcement#1474

Open
dimitri-yatsenko wants to merge 1 commit into
feat/1424-self-upstreamfrom
feat/1425-strict-provenance
Open

feat(#1425): strict_provenance config flag for runtime enforcement#1474
dimitri-yatsenko wants to merge 1 commit into
feat/1424-self-upstreamfrom
feat/1425-strict-provenance

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Summary

T2.2.c of the provenance trinity — completes the trio.

When `dj.config["strict_provenance"] = True`, runtime gates enforce the upstream-only convention inside `make()`:

  • Reads must target a table in the active trace's allowed set (declared ancestors + self + self's Parts).
  • Writes must target self or self's Parts.
  • Inserted rows' PK columns that overlap with the current key must equal the key's values.

Default is False. Existing `make()` bodies are unaffected.

Closes #1425. Slated for DataJoint 2.3.

Branch stack

Layer PR Branch
T1.4 cascade fix #1468 `fix/1429-cascade-part-part-renamed-fk`
T2.2.a `Diagram.trace()` #1471 `feat/1423-diagram-trace`
T2.2.b `self.upstream` #1473 `feat/1424-self-upstream`
T2.2.c `strict_provenance` (this) this PR `feat/1425-strict-provenance`

Will rebase onto master after the chain merges in order.

What's added

Component File
Runtime context module — `ContextVar` + push/pop helpers + read/write gates `src/datajoint/provenance.py` (new)
Config flag `strict_provenance: bool` (default False) + env-var `DJ_STRICT_PROVENANCE` `src/datajoint/settings.py`
Push/pop context in `_populate_one` around the `make()` invocation `src/datajoint/autopopulate.py`
Read gate in `QueryExpression.cursor` `src/datajoint/expression.py`
Write gate in `Table.insert` (after existing `_allow_insert`) `src/datajoint/table.py`
6 integration tests `tests/integration/test_strict_provenance.py` (new)

Implementation notes

  • ContextVar (not threading.local) so the active-make context propagates correctly across `contextvars`-aware boundaries (asyncio, multiprocessing-with-context-propagation).
  • Part-table detection uses class `dict` traversal filtered to `Part` subclasses — `dir/getattr` would trigger the `_JobsDescriptor` and try to lazy-declare `~~table` inside the populate transaction (caught on the first test).
  • Read gate is no-op outside make() — checks `ContextVar.get()` returns None.
  • Write gate is no-op outside make() — same.

Documented limitation (deferred)

The read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from undeclared dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up.

Tests

  • 6/6 new strict-mode tests pass.
  • 17/17 existing autopopulate tests pass with strict default-off — confirms no regression.
  • 8/8 trace + 9/9 cascade tests unaffected.

Test plan

  • 6/6 new tests
  • 17/17 default-off regression
  • CI green after rebase onto master
  • Manual: enable in a staging pipeline and verify legitimate `make()`s still run while violations raise with clear messages

Implements T2.2.c of the provenance trinity, completing the trio
(Diagram.trace → self.upstream → strict_provenance).

When dj.config["strict_provenance"] = True, runtime gates enforce the
upstream-only convention inside make():
- Reads must target a table in the active trace's allowed set
  (declared ancestors + self + self's Parts).
- Writes must target self or self's Parts.
- Inserted rows' PK columns that overlap with the current key must
  equal the key's values (key-consistency rule).

Default is False. Existing make() bodies are unaffected.

Branch stacked on feat/1424-self-upstream (#1473) → feat/1423-diagram-trace
(#1471) → fix/1429-cascade-part-part-renamed-fk (#1468). Will rebase
onto master after the chain merges.

What's added:

- src/datajoint/provenance.py (new): the runtime context module.
  - `_active_strict_make` ContextVar holding (target, allowed_tables,
    key) for the currently-executing make() invocation. ContextVar
    chosen over threading.local to propagate correctly across
    contextvars-aware concurrency boundaries.
  - `push_strict_make_context` / `pop_strict_make_context` — context
    lifecycle managed by `_populate_one`'s try/finally.
  - `assert_read_allowed(query_expression)` — read gate. Recursively
    discovers base tables via the QueryExpression's `_support` chain
    and checks each against the allowed set.
  - `assert_write_allowed(target_table, rows)` — write gate. Verifies
    the target is self or one of self's Part tables, and checks the
    key-consistency rule on each dict row.

- src/datajoint/settings.py: new `strict_provenance: bool` field on
  Config (default False), env-var `DJ_STRICT_PROVENANCE`, ENV_VAR_MAPPING
  entry.

- src/datajoint/autopopulate.py: in `_populate_one`, push the strict
  context (when the flag is on) just before the make() invocation
  block. The allowed table set = trace's ancestor nodes ∪ {self.full_table_name}
  ∪ {self's Parts}. Pop in the existing `finally` block.

- src/datajoint/expression.py: `QueryExpression.cursor` now calls
  `assert_read_allowed(self)` before issuing SQL. No-op outside make().

- src/datajoint/table.py: `Table.insert` calls `assert_write_allowed(self, rows)`
  after the existing `_allow_insert` check. No-op outside make().

Part-table detection uses class `__dict__` traversal (filtered to Part
subclasses) instead of `dir/getattr` to avoid triggering the
`_JobsDescriptor` (which would lazy-declare ~~table inside the populate
transaction — caught by the first test iteration).

Documented limitation (deferred): the read gate does not distinguish
reads that came through `self.upstream` from reads of the same ancestor
via a direct expression. Both are allowed if the table is in the
allowed set. The intent is to catch reads from *undeclared*
dependencies; tightening the "must come through self.upstream" path
requires propagating an attribution marker through QueryExpression
composition and is left for a follow-up release.

Tests in tests/integration/test_strict_provenance.py (6 new):

- test_strict_compliant_make_passes — make() reading via self.upstream
  and writing self.insert1 with matching key runs cleanly under strict.
- test_strict_blocks_read_from_undeclared_table — read from an unrelated
  table raises with "strict_provenance ... undeclared" message.
- test_strict_blocks_write_to_other_table — insert into a non-self,
  non-Part target raises "not permitted".
- test_strict_blocks_write_with_mismatched_key — row PK that disagrees
  with the current key raises "does not match the current make() key".
- test_strict_writes_to_part_table_pass — self.PartName.insert(...) works.
- test_strict_off_by_default_no_change — default-off regression check;
  the canonical "direct (Ancestor & key).fetch1()" pattern still works
  when strict_provenance is unset.

Regression: 17/17 autopopulate tests pass with strict_provenance unset
(default). 6/6 new strict tests pass with strict_provenance=True.
8/8 trace tests + 9/9 cascade tests unaffected.

Slated for DataJoint 2.3.
@dimitri-yatsenko dimitri-yatsenko force-pushed the feat/1425-strict-provenance branch from 9aa7784 to f60495b Compare June 23, 2026 13:23
@dimitri-yatsenko dimitri-yatsenko force-pushed the feat/1424-self-upstream branch from 86b1ae7 to 7e4130f Compare June 23, 2026 13:23
@dimitri-yatsenko dimitri-yatsenko marked this pull request as ready for review June 23, 2026 13:39
@ttngu207

ttngu207 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

A note on adoption: two patterns strict mode will break, and what that tells us we're missing

I've spent time with this PR and the spec, and I want to say up front: the provenance trinity is a real evolution for DataJoint. strict_provenance is exactly the right mechanism to turn "reads should come from declared ancestors" from a convention we hope people follow into a guarantee the framework actually enforces. None of what follows is a blocker, and none of it is a criticism of the design.

What I want to do here is think through adoption. There are a couple of patterns that are fairly commonly used in practice today that this feature will break. Both of them are, honestly, bad patterns — they're provenance bugs, and strict mode is right to catch them. But they're bad patterns that a lot of working pipelines lean on, and I think if we turn this on without a story for how those pipelines migrate, we take DataJoint to the next level and leave a chunk of our users behind. So this is really an argument for the next thing we should build, using these two patterns to show why.

Pattern 1 — reading "sideways" to a table that isn't a declared ancestor

The common case looks like this: a computation needs a piece of reference information that lives off to the side of its own lineage. For example, SpikeDetection depends on Recording, but to set a baseline it wants the brain region — which it gets by walking up to Subject and then back down a different branch to Surgery → BrainRegion:

class SpikeDetection(dj.Computed):
    definition = "-> Recording"
    def make(self, key):
        signal = (Recording & key).fetch1("signal")          # declared ancestor — fine
        region = (BrainRegion & (Surgery & key)).fetch1("region")  # NOT an ancestor
        baseline = BASELINES[region]
        ...

BrainRegion isn't upstream of SpikeDetection — it's a cousin. Under strict mode this read is blocked, and that's correct: if the region's baseline feeds the result, then BrainRegion genuinely is a dependency, and today nothing records that. Edit a BrainRegion row and every SpikeDetection derived from it is silently stale, with no way to know. So strict mode is doing exactly its job — turning a silent, invisible dependency into a loud, declared one.

Worth noting the boundary: the very common "parameter table" case (-> AnalysisParams) is already a declared ancestor, so it keeps working. Only genuinely undeclared sideways reads break. This one has a reasonably clear fix — make the dependency real by declaring it — so I'll spend most of the space on the second pattern, which is harder.

Pattern 2 — one make() that ingests a file and writes to many tables

This one is about using DataJoint's populate machinery (job reservation, error tracking, triggering on new data) as an orchestration layer. A very common shape: a table triggers when a new file appears, and its make() parses that file — which usually contains nested, hierarchical information — and fans the results out into a bunch of normalized entity tables:

class FileIngestion(dj.Computed):
    definition = "-> FileManifest"
    def make(self, key):
        data = parse(fetch_file(key))
        Session.insert(data["sessions"])   # writing to tables that aren't self
        Block.insert(data["blocks"])
        Trial.insert(data["trials"])
        Event.insert(data["events"])

Strict mode blocks this, because a make() may only write to self and its Parts. And again it's right to: those Session/Trial/Event rows land with no foreign-key edge back to the file they came from, so there's no way to trace an entity to its source. The traceability we care about is genuinely broken here.

The obvious suggestion is "make those tables Parts of FileIngestion." That doesn't work, and the reason is instructive: Session, Trial, and Event are top-level entities that other pipeline branches depend on (Recording -> Session, SpikeSorting -> Trial, and so on). Turning them into Parts of an ingestion table would force all those downstream branches to reference a Part from outside its master — the exact thing the master/Part convention tells us not to do. So Parts are structurally the wrong tool here, not just semantically awkward.

The hard part: for pattern 2 there's no clean replacement, and here's why

When I work through the alternatives for orchestrated ingestion, every strict-compliant option gives up something real. There are three:

Option A — the "landing-table" approach. Restructure so the ingestion is a real upstream table: ParsedFile becomes an ancestor, and Session, Trial, etc. each depend on it and get their own single-target make() that reads the already-parsed data and writes only to itself. This is strict-compliant and provenance is intact. But it means your entities are now downstream of a file. The pipeline reads as "everything begins with a data source" rather than "everything begins with a subject and a session." It's input-driven design, and I don't think it reflects the truth of the model.

Option B — keep the entities as roots, orchestrate outside DataJoint. Session etc. stay top-level tables, and some external system (a script, Airflow, whatever) does the file-watching and inserts them. The domain model stays clean — but you lose the DataJoint populate machinery you wanted in the first place, and the file→entity link now lives entirely outside DataJoint, invisible to it.

Option C — the fat make() above. Clean domain model, keeps orchestration — but it's the pattern strict mode exists to forbid, precisely because it severs provenance.

Laid out together:

Durable domain model Records where the data came from Stable when ingestion changes Keeps DataJoint orchestration
A. Landing table (file is upstream) ✗ input-driven ~ but as a fake dependency ✗ new source ⇒ schema change
B. Roots + external orchestration ~ but outside DataJoint
C. Fat make() (what strict mode blocks) ✗ severed

Every row loses at least one thing that matters.

Why the "landing table" option is worse than it first looks

Option A is the one people will reach for, so it's worth being clear about its deepest problem: it bakes a transient operational fact into the durable data model.

How an entity physically arrives is not durable. A Session might come from a file today, an API call next quarter, a direct database sync the year after. But Session belongs to Subject — that's the real, invariant truth of the model, and it holds no matter how the data got there. These two facts live on completely different timescales: the domain relationships change on the scale of scientific redesign (years), while ingestion mechanics change on the scale of infrastructure (months).

The foreign-key graph is supposed to encode the durable, invariant relationships. The moment you put "came from this file" into it, you've married the slow-moving graph to the fast-moving one. Every time ingestion changes, you're doing a schema migration; and your historical rows permanently carry whatever ingestion ancestor happened to exist when they were created. So the principled strict-mode option is actually the least stable one over the life of a pipeline — enforcing provenance this way makes the schema churn with your infrastructure. That's a genuinely counterintuitive result, and I think it's the crux.

Off versus on, stated carefully

Here's the asymmetry that worries me. With strict_provenance off, a workable option exists — the fat make() — that keeps a clean, durable domain model and keeps DataJoint orchestration. Its only casualty is provenance, which the framework wasn't guaranteeing anyway. With it on, that option is gone, and there's no remaining option that records the file→entity relationship truthfully — as "how it arrived," rather than as a fabricated (and non-durable) dependency in the FK graph — while keeping orchestration. So turning strict mode on doesn't just close a loophole; for this class of pipeline it takes away the one option whose only cost was something already soft, and leaves you choosing between corrupting the durable model (A) or giving up the framework's orchestration (B).

What this actually points to — two things worth building

I don't think the takeaway is "don't ship strict mode." I think it's that strict mode is the feature that makes two follow-on capabilities necessary — it's the alarm, and these are the fire exits:

  1. Provenance-first schema evolution. The reason both patterns are so painful to fix is that the remediation is always "restructure your schema," and we have no tooling for that — so people fall back on drop-and-repopulate, which destroys history. A proper schema-evolution tool would use the trinity's own primitives — trace to find what's upstream of a change, cascade to find what it invalidates — to change dependency structure and recompute safely. It's not alembic; it's evolution that manages the invalidation cascade. And it's built from exactly what this PR ships, which is what makes it a natural next step rather than a detour.

  2. A first-class notion of where data came from, kept separate from the FK graph. Everything above keeps circling one gap: DataJoint has one graph, and it uses it for both "what is derived from what" (durable, invariant) and, when forced, "how this row arrived" (transient, operational). Those need to be different things. Ingestion lineage should be recorded — append-only, allowing many sources over time — so a Session can carry a "came from file X" record today and an "came from API Y" record later, with no schema change at all. The FK graph structurally can't do that, and forcing it to (option A) is exactly what corrupts the model.

Neither of these is a 2.3 concern, and I'm not suggesting they gate this release. But I'd love for us to decide the sequencing deliberately, so we're not shipping the alarm well ahead of the exits. Happy to write these up properly as their own proposals if that's useful.

@ttngu207 ttngu207 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a close read of the enforcement path and want to flag three correctness issues before this ships, so I'm requesting changes. To be clear up front: the scaffolding is right — the ContextVar push/pop with token reset, the exception-path cleanup, transaction rollback, Part detection via __dict__ (nicely avoiding the _JobsDescriptor), and the flag-off no-op are all correct. The problems are in the two gate functions themselves. None of this is about the broader strict-mode design conversation; it's purely implementation correctness.

1. [blocking] Write gate can silently drop all rows — data loss on compliant code

assert_write_allowed(self, rows) runs in Table.insert (table.py:804) before insert materializes rows, and when a strict context is active its non-dict branch does for row in rows. For a one-shot iterable (generator, iter(...), map(...)), that exhausts it; the post-gate consumer — _insert_rowslist(… for row in rows) at table.py:864, or the chunked iter(rows) path at :836 — then sees an empty iterator and inserts zero rows, with no error:

def make(self, key):
    self.insert(dict(**key, x=i) for i in range(n))   # strict on → inserts nothing, silently

Two precisions, because they make this a genuine blocker rather than an edge case:

  • It's scoped to one-shot iterables. A list/tuple of dicts is safe — it's re-iterable, so the gate walks it and insert re-walks it fine. So self.insert([...]) is unaffected; only self.insert(<generator>) breaks. But passing a generator to insert() is a documented, idiomatic pattern (the chunked-insert path is literally built around iter(rows)), so this isn't exotic input.
  • It strikes compliant code. The target check passes for a normal self.insert(<generator>) into the table being populated — the gate then consumes the generator anyway. So a correct, strict-compliant make() that inserts many computed rows (very common for Part tables) silently loses all of them. This isn't "you did something forbidden and got a confusing failure"; it's "you wrote correct code and it discarded your data."

Fix direction: don't consume rows in the gate — do the target-only check first (it doesn't need the rows), and defer the key-consistency check to where insert has already materialized rows into a concrete list.

Separately, insert(some_query_expression) isn't data loss but is double-executed: the gate iterates it (running the source SELECT client-side and firing the read gate on it), then insert re-runs it as INSERT … SELECT. Worth fixing in the same pass.

2. [blocking] Read gate misses len() / bool() / in

Only QueryExpression.cursor calls assert_read_allowed. But __len__ (expression.py:1149), __bool__ (:1178), and __contains__ (:1195, via bool(self & item)) each issue connection.query(...) directly and never touch cursor. So inside make():

len(Unrelated & key)     # ungated
bool(Unrelated & cond)   # ungated
key in Unrelated         # ungated (→ __bool__)

all read a forbidden table with no check. Aggregation.__len__/__bool__ and Union.__len__/__bool__ have the same direct-query bypass.

3. [blocking] Read gate is blind to restriction-by-table

_base_tables (provenance.py:56) only walks full_table_name and _support. A semijoin restriction Ancestor & Unrelated renders Unrelated as an IN (subquery) in the WHERE clause — it never enters _support — so _base_tables returns {Ancestor} and the read of Unrelated passes. (Joins are caught, because join extends _support; restrictions are not, and restrict-by-table is at least as common.)

Net on 2 + 3: the gate catches UndeclaredTable.fetch() but not the everyday idioms (restrict-by, existence check, count), so an undeclared dependency stays readable through several common paths. Since the whole value here is a guarantee, I think these need to close before merge — a gate that blocks the naive case while passing the common ones invites false confidence, which is worse than an honest "best-effort."

Tests

The 6 tests exercise the naive paths and would all pass with the three bugs above present. Worth adding: join-with-a-disallowed-operand (the one read path that actually works — untested, so a regression would be invisible), restriction-by-table, len/in/bool, insert-from-query, generator rows, reentrant make() context restore, and exception-path context cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants